Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9157 kw search update backfill folders for GitHub #9260

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

overmode
Copy link
Contributor

@overmode overmode commented Dec 10, 2024

Description

  • Add node_folder upserts and deletes
  • Migration script

Risk

Break Github connection, lose synced Github data

Tests

  • created toy repo, triggered sync with bash admin/cli.sh github code-sync
    -> Made sure that the state is fine (correct parents, correct folder created/deleted) once the sync is triggered. Test cases include creating/deleting/updating issue, discussion, folder, repo
    ---> monitor data_source_folders, data_source_nodes, and github tables (github_code_directories, github_code_files, github_code_repositories, ...)

For testing the backfill, add connection with main branch (no folder_node), then checkout and trigger sync -> check tables listed above.

Once everything works, do it with the Dust repo locally (because bigger)

Then we're ok to merge

Deploy Plan

@overmode overmode linked an issue Dec 10, 2024 that may be closed by this pull request
@overmode overmode force-pushed the 9157-kw-search-update-backfill-folders-for-github branch from d26fe4b to d931d7b Compare December 10, 2024 17:36
@overmode overmode marked this pull request as ready for review December 11, 2024 09:38
connectors/src/connectors/github/lib/hierarchy.ts Outdated Show resolved Hide resolved
connectors/src/connectors/github/lib/hierarchy.ts Outdated Show resolved Hide resolved
Comment on lines 1211 to 1216
await upsertFolderNode({
dataSourceConfig,
folderId: d.internalId,
parents: [d.internalId, ...d.parents, repoId.toString()],
title: d.dirName,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing the intermediate "github-code-" thing, which does not exist in connectors database, but does exists in d.parents.
I also did not check the issues/discussion structure but i think we also have intermediate virtual folders here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no structure for issues, it's just a big database entry)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new commit handles the "Code", "Issues" and "Discussions" nodes

Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 having a look, providing here first elements about PR description:

  • for deploy: here you would write that you deploy connectors, then run the backfill (and in general any other steps, such as monitoring X or Y for risky deploys, etc.)
  • on risk: since pretty risky, can you share the local tests you made here to ensure this works well? should include checking after various inserts / deletes, & running the backfill

Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks directionally good :)
Left a few coms
Also, can you explain how you detected all the places where we needed to upsert / delete folders?

await deleteFolderNode({
dataSourceConfig,
folderId: repoId,
loggerArgs: logger.bindings(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 points here:

  1. what led you to pass logger.bindings here? not sure what this does
  2. in general, this argument is now not necessary, you can remove it from the call and even delete it from deleteFolderNode

Copy link
Contributor Author

@overmode overmode Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it because it is used in upsertDataSource. Indeed currently not used in deleteFolderNode (although it is a parameter), but thought that if we start using it, at least we don't have to modify this call to pass it. I'm ok with deleting it 👍

dataSourceConfig,
folderId: githubCodeRepository.repoId,
title: githubCodeRepository.repoName,
parents: [githubCodeRepository.repoId],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget, parents[0] === folder_id as you changed in connectors/src/connectors/github/lib/hierarchy.ts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we are uploading a node for the repository itself, which is at root level, that's why it is its own id

@overmode overmode requested a review from tdraier December 11, 2024 14:49
@philipperolet
Copy link
Contributor

As discussed IRL, we wait for https://github.com/dust-tt/tasks/issues/1785 to be done here, in order to avoid having to migrate folders too -- and we update PR accordingly

Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: how is code adressed here? we'd need to update folders for code too right?
image

@overmode
Copy link
Contributor Author

overmode commented Dec 12, 2024

We create the following data_source_folders:

  • 1 for the Code directory named 'Code' with id 'github-code-${repoId}'
  • 1 for each folder GithubCodeDirectory and make sure that the 'Code' directory is a parent.
  • 1 for the repository -> Top of the hierarchy

This is for Code Sync, but we also handle Issues and discussions via the :

  • Issues folder -> Parent of all issues, not a child of Code, child of Repo
  • Discussions folder -> Parent of all discussions, not a child of Code, child of Repo

@philipperolet
Copy link
Contributor

Thanks ! Very clear 👍

We create the following data_source_folders:

  • 1 for the Code directory named 'Code' with id 'github-code-${repoId}'
  • 1 for each folder GithubCodeDirectory and make sure that the 'Code' directory is a parent.

@overmode
Copy link
Contributor Author

Note: handling of the GithubConnectorStrategy.delete function is missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[KW-Search] Update & backfill folders for github
3 participants